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ABSTRACT 

After a more than decade-long period of relatively little research ac- 
tivity in the area of recurrent neural networks, several new develop- 
ments will be reviewed here that have allowed substantial progress 
both in understanding and in technical solutions towards more effi- 
cient training of recurrent networks. These advances have been mo- 
tivated by and related to the optimization issues surrounding deep 
learning. Although recurrent networks are extremely powerful in 
what they can in principle represent in terms of modeling sequences, 
their training is plagued by two aspects of the same issue regarding 
the learning of long-term dependencies. Experiments reported here 
evaluate the use of clipping gradients, spanning longer time ranges 
with leaky integration, advanced momentum techniques, using more 
powerful output probability models, and encouraging sparser gra- 
dients to help symmetry breaking and credit assignment. The ex- 
periments are performed on text and music data and show off the 
combined effects of these techniques in generally improving both 
training and test error 

Index Terms — Recurrent networks, deep learning, representa- 
tion learning, long-term dependencies 

1. INTRODUCTION 

Machine learning algorithms for capturing statistical structure in se- 
quential data face a fundamental problem LLi2J, called the difficulty 
of learning long-term dependencies. If the operations performed 
when forming a fixed-size summary of relevant past observations 
(for the puipose of predicting some future observations) are linear, 
this summary must exponentially forget past events that are further 
away, to maintain stability. On the other hand, if they are non-linear, 
then this non-linearity is composed many times, yielding a highly 
non-linear relationship between past events and future events. Learn- 
ing such non-linear relationships turns out to be difficult, for reasons 
that are discussed here, along with recent proposals for reducing this 
difficulty. 

Recurrent neural networks [3 | can represent such non-linear 
maps (F, below) that iteratively build a relevant summary of past 
observations. In their simplest form, recurrent neural networks 
(RNNs) form a deterministic state variable ht as a function of the 
present input observation xt and the past value(s) of the state vari- 
able, e.g., ht = Fg{ht-i,xt), where are tunable parameters that 
control what will be remembered about the past sequence and what 
will be discarded. Depending on the type of problem at hand, a loss 
function L{ht,yt) is defined, with yt an observed random variable 
at time t and Ct = L{ht, yt) the cost at time t. The generalization 
objective is to minimize the expected future cost, and the training 
objective involves the average of Ct over observed sequences. In 
principle, RNNs can be trained by gradient-based optimization pro- 
cedures (using the back-propagation algorithm |3| to compute a 
gradient), but it was observed early on [1, 2 | that capturing depen- 
dencies that span a long interval was difficult, making the task of 



optimizing 6 to minimize the average of Ct's almost impossible for 
some tasks when the span of the dependencies of interest increases 
sufficiently. More precisely, using a local numerical optimization 
such as stochastic gradient descent or second order methods (which 
gradually improve the solution), the proportion of trials (differing 
only from their random initialization) falling into the basin of at- 
traction of a good enough solution quickly becomes very small as 
the temporal span of dependencies is increased (beyond tens or 
hundreds of steps, depending of the task). 

These difficulties are probably responsible for the major reduc- 
tion in research efforts in the area of RNNs in the 90's and 2000's. 
However, a revival of interest in these learning algorithms is taking 
place, in particular thanks to ||4] and JS]- This paper studies the is- 
sues giving rise to these difficulties and discusses, reviews, and com- 
bines several techniques that have been proposed in order to improve 
training of RNNs, following up on a recent thesis devoted to the sub- 
ject |6|. We find that these techniques generally help generalization 
performance as well as training performance, which suggest they 
help to improve the optimization of the training criterion. We also 
find that although these techniques can be applied in the online set- 
ting, i.e., as add-ons to stochastic gradient descent (SGD), they allow 
to compete with batch (or large minibatch) second-order methods 
such as Hessian-Free optimization, recently found to greatly help 
training of RNNs |[4J. 

2. LEARNING LONG-TERM DEPENDENCIES AND THE 
OPTIMIZATION DIFFICULTY WITH DEEP LEARNING 

There has been several breakthroughs in recent years in the algo- 
rithms and results obtained with so-called deep learning algorithms 
(see [7 1 and |8| for reviews). Deep learning algorithms discover 
multiple levels of representation, typically as deep neural networks 
or graphical models organized with many levels of representation- 
carrying latent variables. Very little work on deep architectures 
occurred before the major advances of 2006 ||9] [lO] HD, probably 
because of optimization difficulties due to the high level of non- 
linearity in deeper networks (whose output is the composition of 
the non-linearity at each layer). Some experiments [12 | showed the 
presence of an extremely large number of apparent local minima of 
the training criterion, with no two different initializations going to 
the same ffinction (i.e. eliminating the effect of permutations and 
other symmetries of parametrization giving rise to the same func- 
tion). Furthermore, qualitatively different initialization (e.g., using 
unsupervised learning) could yield models in completely different 
regions of function space. An unresolved question is whether these 
difficulties are actually due to local minima or to ill-conditioning 
(which makes gradient descent converge so slowly as to appear 
stuck in a local minimum). Some ill-conditioning has clearly been 
shown to be involved, especially for the difficult problem of training 
deep auto-encoders, through comparisons 1 13] of stochastic gradient 
descent and Hessian-free optimization (a second order optimiza- 



tion method). These optimization questions become particularly 
important when trying to train very large networks on very large 
datasets 1141 . where one realizes that a major challenge for deep 
learning is the imderfitting issue. Of course one can trivially overfit 
by increasing capacity in the wrong places (e.g. in the output layer), 
but what we are trying to achieve is learning of more powerful 
representations in order to also get good generalization. 

The same questions can be asked for RNNs. When the compu- 
tations performed by a RNN are unfolded through time, one clearly 
sees a deep neural network with shared weights (across the 'layers', 
each corresponding to a different time step), and with a cost function 
that may depends on the output of intermediate layers. Hessian-free 
optimization has been successfully used to considerably extend the 
span of temporal dependencies that a RNN can learn f?!, suggest- 
ing that ill-conditioning effects are also at play in the difficulties of 
training RNN. 

An important aspect of these difficulties is that the gradient can 
be decomposed I2l ll5l into terms that involve products ofjacobians 
Q^''* ^ over subsequences linking an event at time ti and one at time 

fa: = nt=ti+i dh'^^ I ■ ^2 — tl increases, the products of 
t2 — tl of these Jacobian matrices tend to either vanish (when the 
leading eigenvalues of g^^** ^ are less than 1) or explode (when the 

leading eigenvalues of g^''* are greater than iQ). This is problem- 
atic because the total gradient due to a loss Ctj at time t2 is a sum 
whose terms correspond to the effects at different time spans, which 
are weighted by ^j;^ for different fi's: 

dCt2 _ sr^ dCt2 dht2 dht^ 

where j^^jj is the derivative of ht^ with respect to the instantia- 
tion of the parameters 9 at step ti, i.e., that directly come into the 
computation of ht-^ in F. When the tend to vanish for increas- 
ing t2~ti, the long-term term effects become exponentially smaller 
in magnitude than the shorter-term ones, making it very difficult to 
capture them. On the other hand, when "explode" (becomes 
large), gradient descent updates can be destructive (move to poor 
configuration of parameters). It is not that the gradient is wrong, it is 
that gradient descent makes small but finite steps yielding a AC, 
whereas the gradient measures the effect of AC when A9 — > 0. A 
much deeper discussion of this issue can be found in 1 15], along with 
a point of view inspired by dynamical systems theory and by the ge- 
ometrical aspect of the problem, having to do with the shape of the 
training criterion as a function of 6 near those regions of exploding 
gradient. In particular, it is argued that the strong non-linearity oc- 
curring where gradients explode is shaped like a cliff where not just 
the first but also the second derivative becomes large in the direc- 
tion orthogonal to the cliff. Similarly, flatness of the cost function 
occurs simultaneously on the first and second derivatives. Hence di- 
viding the gradient by the second derivative in each direction (i.e., 
pre-multiplying by the inverse of some proxy for the Hessian ma- 
trix) could in principle reduce the exploding and vanishing gradient 
effects, as argued in 14]. 



' Note that this is not a sufficient condition, but a necessary one. Further 
more one usually wants to operate in the regime where the leading eigenvalue 
is larger than 1 but the gradients do not explode. 



3. ADVANCES IN TRAINING RECURRENT NETWORKS 
3.1. Clipped Gradient 

To address the exploding gradient effect, lfT6] [15 1 recently proposed 
to clip gradients above a given threshold. Under the hypothesis that 
the explosion occurs in very small regions (the cliffs in cost func- 
tion mentioned above), most of the time this will have no effect, but 
it will avoid aberrant parameter changes in those cliff regions, while 
guaranteeing that the resulting updates are still in a descent direction. 
The specific form of clipping used here was proposed in 1 15] and is 
discussed there at much greater length: when the norm of the gradi- 
ent vector g for a given sequence is above a threshold, the update 
is done in the direction threshold . As argued in f 151, this very 
simple method implements a very simple form of second order opti- 
mization in the sense that the second derivative is also proportionally 
large in those exploding gradient regions. 



3.2. Spanning Longer Time Ranges with Leaky Integration 

An old idea to reduce the effect of vanishing gradients is to intro- 
duce shorter paths between ti and t2, either via connections with 
longer time delays f 171 or inertia (slow-changing units) in some of 
the hidden units |18 19|, or both |20|. Long-Short-Term Mem- 
ory (LSTM) networks [21], which were shown to be able to han- 
dle much longer range dependencies, also benefit from a linearly 
self-connected memory unit with a near I self-weight which allows 
signals (and gradients) to propagate over long time spans. 

A different inteipretation to this slow-changing units is that they 
behave like low-pass filter and hence they can be used to focus cer- 
tain units on different frequency regions of the data. The analogy 
can be brought one step further by introducing band-pass filter units 
1221 or by using domain specific knowledge to decide on what fre- 
quency bands different units should focus. 1231 shows that adding 
low frequency information as an additional input to a recurrent net- 
work helps improving the performance of the model. 

In the experiments performed here, a subset of the units were 
forced to change slowly by using the following "leaky integration" 
state-to-state map: ht,i = ctiht-i,, + (1 — ai)Fi{ht-i, Xt). The 
standard RNN corresponds to — 0, while here different values 
of Qi were randomly sampled from (0.02, 0.2), allowing some units 
to react quickly while others are forced to change slowly, but also 
propagate signals and gradients further in time. Note that because 
a < 1, the vanishing effect is still present (and gradients can still 
explode via F), but the time-scale of the vanishing effect can be 
expanded. 



3.3. Combining Recurrent Nets with a Powerful Output Proba- 
bility Model 

One way to reduce the underfitting of RNNs is to introduce multi- 
plicative interactions in the parametrization of F, as was done suc- 
cessfully in O. When the output predictions are multivariate, an- 
other approach is to capture the high-order dependencies between 
the output variables using a powerful output probability model such 
as a Restricted Boltzmann Machine (RBM) 1 24 , 25 1 or a determinis- 
tic variant of it called NADE |26 25 1. In the experiments performed 
here, we have experimented with a NADE output model for the mu- 
sic data. 



3.4. Sparser Gradients via Sparse Output Regularization and 
Rectified Outputs 

(7)1 hypothesized that one reason for the difficuhy in optimizing 
deep networlcs is that in ordinary neural networks gradients diffuse 
tiirough the layers, diffusing credit and blame through many units, 
maybe making it difficult for hidden units to specialize. When 
the gradient on hidden units is more sparse, one could imagine 
that symmetries would be broken more easily and credit or blame 
assigned less unifomly. This is what was advocated in 1271 , ex- 
ploiting the idea of rectifier non-linearities introduced earlier in 
1281 , i.e., the neuron non-linearity is out — max(0, in) instead 
of out = tanh(m) or out — sigmoid(m). This approach was 
very successful in recent work on deep learning for object recog- 
nition [29], beating by far the state-of-the-art on ImageNet (1000 
classes). Here, we apply this deep learning idea to RNNs, using 
an LI penalty on outputs of hidden units to promote sparsity of 
activations. The underlying hypothesis is that if the gradient is con- 
centrated in a few paths (in the unfolded computation graph of the 
RNN), it will reduce the vanishing gradients effect. 

3.5. Simplifled Nesterov Momentum 

Nesterov accelerated gradient (NAG) II30I is a first-order optimiza- 
tion method to improve stability and convergence of regular gradient 
descent. Recently, 1 6 1 showed that NAG could be computed by the 
following update rules: 

vt = fit~ivt-i — et-iV f{6t~i + fit-ivt-i) (1) 
Ot = Ot-i + Vt (2) 

where 0t are the model parameters, vt the velocity, G [0, 1] the 
momentum (decay) coefficient and et > the learning rate at it- 
eration t, f{0) is the objective function and Vf{6') is a shorthand 
notation for the gradient ^g^-* \ g^0r. These equations have a form 
similar to standard momentum updates: 

Vt = fJ.t~iVt-i - et-iV f{9t-i) (3) 
et=et-i+vt (4) 

^et-i + fit-ivt-i -et-i\/f{et^i) (5) 

and differ only in the evaluation point of the gradient at each itera- 
tion. This important difference, thought to counterbalance too high 
velocities by "peeking ahead" actual objective values in the candi- 
date search direction, results in significantly improved RNN perfor- 
mance on a number of tasks. 

In this section, we derive a new formulation of Nesterov mo- 
mentum differing from (|3} and (|5) only in the linear combination 
coefficients of the velocity and gradient contributions at each itera- 
tion, and we offer an alternative interpretation of the method. The 
key departure from ([T) and (|2) resides in committing to the "peeked- 
ahead" parameters Qt-i = dt-i + fit-iVt-i and backtracking by 
the same amount before each update. Our new parameters Qt up- 
dates become: 

Vt = ^t-iVt-i - et_iV/(9t_i) (6) 
©t = Ot-i - tJ-t^lVt-l + tJ-tVt + Vt 

= Qt-i + fitfJ.t-iVt-1 - (1 + Mt)et-i V/(et-i) (7) 

Assuming a zero initial velocity vi = and velocity at convergence 
of optimization vt — 0, the parameters S are a completely equiva- 
lent replacement of 6. 



Note that equation (O is identical to regular momentum ([5} 
with different linear combination coefficients. More precisely, for an 
equivalent velocity update JSJ, the velocity contribution to the new 
parameters fitfit-i < is reduced relatively to the gradient con- 
tribution {1 + fit)tt-i > et-i- This allows storing past velocities 
for a longer time with a higher /i, while actually using those veloci- 
ties more conservatively during the updates. We suspect this mecha- 
nism is a crucial ingredient for good empirical performance. While 
the "peeking ahead" point of view suggests that a similar strategy 
could be adapted for regular gradient descent (misleadingly, because 
it would amount to a reduced learning rate et), our derivation shows 
why it is important to choose search directions aligned with the cur- 
rent velocity to yield substantial improvement. The general case is 
also simpler to implement. 

4. EXPERIMENTS 

In the experimental section we compare vanilla SGD versus SGD 
plus some of the enhancements discussed above. Specifically we 
use the letter 'C to indicate that gradient clipping is used, 'L' for 
leaky-integration units, 'R' if we use rectifier units with LI penalty 
and 'M' for Nesterov momentum. 

4.1. Music Data 

We evaluate our models on the four polyphonic music datasets of 
varying complexity used in 1251 : classical piano music (Piano- 
midi.de), folk tunes with chords instantiated from ABC nota- 
tion (Nottingham), orchestral music (MuseData) and the four-part 
chorales by J.S. Bach (JSB chorales). The symbolic sequences con- 
tain high-level pitch and timing information in the form of a binary 
matrix, oi piano-roll, specifying precisely which notes occur at each 
time-step. They fom interesting benchmarks for RNNs because of 
their high dimensionality and the complex temporal dependencies 
involved at different time scales. Each dataset contains at least 7 
hours of polyphonic music with an average polyphony (number of 
simultaneous notes) of 3.9. 

Piano-rolls were prepared by aligning each time-step (88 pitch 
labels that cover the whole range of piano) on an integer fraction 
of the beat (quarter note) and transposing each sequence in a com- 
mon tonality (C major/minor) to facilitate learning. Source files and 
preprocessed piano-rolls split in train, validation and test sets are 
available on the authors' websitfl 

4.1.1. Setup and Results 

We select hyperparameters, such as the number of hidden units Uh, 
regularization coefficients Xli, the choice of non-linearity function, 
or the momentum schedule nt, learning rate et, number of leaky 
units nieaky OT leaky factors a according to log-likelihood on a val- 
idation set and we report the final performance on the test set for the 
best choice in each category. We do so by using random search |31 1 
on the following intervals: 

n^e [100,400] et e [10~^ 10-^] 

fit G [10-^ 0.95] Xli G [10"^ 10"^] 

nieaky G {0%, 25%, 50%} a G [0.02, 2] 

The cutoff threshold for gradient clipping is set based on the 
average norm of the gradient over one pass on the data, and we used 
15 in this case for all music datasets. The data is split into sequences 

www-etud. iro . umontreal . ca/ -boulanni/icml2012, 



Table 1. Log-likelihood and expected accuracy for various RNN models in the symbolic music prediction task. The double line separates 
sigmoid recognition layers (above) to structured output probability models (below). 



Model Piano-midi.de Nottingham MuseData JSB chorales 





LL 
(train) 


LL 

(test) 


ACC% 
(test) 


LL 

(train) 


LL 

(test) 


ACC% 
(test) 


LL 

(train) 


LL 
(test) 


ACC% 
(test) 


LL 
(train) 


LL 

(test) 


ACC% 
(test) 


RNN (SGD) 


-7.10 


-7.86 


22.84 


-3.49 


-3.75 


66.90 


-6.93 


-7.20 


27.97 


-7.88 


-8.65 


29.97 


RNN (SGD-l-C) 


-7.15 


-7.59 


22.98 


-3.40 


-3.67 


67.47 


-6.79 


-7.04 


30.53 


-7.81 


-8.65 


29.98 


RNN (SGD-l-CL) 


-7.04 


-7.57 


22.97 


-3.31 


-3.57 


67.97 


-6.47 


-6.99 


31.53 


-7.78 


-8.63 


29.98 


RNN (SGD-l-CLR) 


-6.40 


-7.80 


24.22 


-2.99 


-3.55 


70.20 


-6.70 


-7.34 


29.06 


-7.67 


-9.47 


29.98 


RNN (SGD-l-CRM) 


-6.92 


-7.73 


23.71 


-3.20 


-3.43 


68.47 


-7.01 


-7.24 


29.13 


-8.08 


-8.81 


29.52 


RNN (HF) 


-7.00 


-7.58 


22.93 


-3.47 


-3.76 


66.71 


-6.76 


-7.12 


29.77 


-8.11 


-8.58 


29.41 


RNN-RBM 


N/A 


-7.09 


28.92 


N/A 


-2.39 


75.40 


N/A 


-6.01 


34.02 


N/A 


-6.27 


33.12 


RNN-NADE (SGD) 


-7.23 


-7.48 


20.69 


-2.85 


-2.91 


64.95 


-6.86 


-6.74 


24.91 


-5.46 


-5.83 


32.11 


RNN-NADE (SGD-l-CR) 


-6.70 


-7.34 


21.22 


-2.14 


-2.51 


69.80 


-6.27 


-6.37 


26.60 


-4.44 


-5.33 


34.52 


RNN-NADE (SGD+CRM) 


-6.61 


-7.34 


22.12 


-2.11 


-2.49 


69.54 


-5.99 


-6.19 


29.62 


-4.26 


-5.19 


35.08 


RNN-NADE (HF) 


-6.32 


-7.05 


23.42 


-1.81 


-2.31 


71.50 


-5.20 


-5.60 


32.60 


-4.91 


-5.56 


32.50 



Table 2. Entropy (bits per character) and perplexity for various RNN models on next character and next word prediction task. 



Model Perm Treebank Corpus Penn Treebank Corpus 





word level 


character level 




perplexity 


perplexity 


entropy 


entropy 




(train) 


(test) 


(train) 


(test) 


RNN (SGD) 


112.11 


145.16 


1.78 


1.76 


RNN (SGD+C) 


78.71 


136.63 


1.40 


1.44 


RNN (SGD+CL) 


76.70 


129.83 


1.56 


1.56 


RNN (SGD+CLR) 


75.45 


128.35 


1.45 


1.49 



of 100 steps over which we compute the gradient. The hidden state 
is carried over from one sequence to another if they belong to the 
same song, otherwise is set to 0. 

Table [T] presents log-likelihood (LL) and expected frame-level 
accuracy for various RNNs in the symbolic music prediction task. 

Results clearly show that these enhancements allow to improve 
on regular SGD in almost all cases; they also make SGD competitive 
with HF for the sigmoid recognition layers RNNs. 

4.2. Text Data 

We use the Penn Treebank Corpus to explore both word and char- 
acter prediction tasks. The data is split by using sections 0-20 as 
training data (5017k characters), sections 21-22 as validation (393k 
characters) and sections 23-24 as test data (442k characters). 

For the word level prediction, we fix the dictionary to 10000 
words, which we divide into 30 classes according to their frequency 
in text (each class holding approximately 3.3% of the total number 
of tokens in the training set). Such a factorization allows for faster 
implementation, as we are not required to evaluate the whole output 
layer (10000 units) which is the computational bottleneck, but only 
the output of the corresponding class II32I . 

4.2.1. Setup and Results 

In the case of next word prediction, we compute gradients over se- 
quences of 40 steps, where we carry the hidden state from one se- 
quence to another. We use a small grid-search around the parameters 



used to get state of the art results for this number of classes 1321 . i.e., 
with a network of 200 hidden units yielding a perplexity of 134. We 
explore learning rate of 0.1,0.01,0.001, rectifier units versus sig- 
moid units, cutoff threshold for the gradients of 30, 50 or none, and 
no leaky units versus 50 of the units being sampled from 0.2 and 
0.02. 

For the character level model we compute gradients over se- 
quences of 150 steps, as we assume that longer dependencies are 
more crucial in this case. We use 500 hidden units and explore learn- 
ing rates of 0.5, 0.1 and 0.01. 

In table [2] we have entropy (bits per character) or perplexity for 
varous RNNs on the word and character prediction tasks. Again, we 
observe substantial improvements in both training and test perplex- 
ity, suggesting that these techniques make optimization easier. 



5. CONCLUSIONS 

Through our experiments we provide evidence that part of the issue 
of training RNN is due to the rough error surface which can not be 
easily handled by SGD. We follow an incremental set of improve- 
ments to SGD, and show that in most cases they improve both the 
training and test error, and allow this enhanced SGD to compete or 
even improve on a second-order method which was found to work 
particularly well for RNNs, i.e., Hessian-Free optimization. 
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